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ABSTRACT 

We present LDAExplore, a tool to visualize topic distributions 
in a given document corpus that are generated using Topic 
Modeling methods. Latent Dirichlet Allocation (LDA) is one 
of the basic methods that is predominantly used to generate 
topics. One of the problems with methods like LDA is that 
users who apply them may not understand the topics that are 
generated. Also, users may find it difficult to search corre¬ 
lated topics and correlated documents. LDAExplore, tries to 
alleviate these problems by visualizing topic and word dis¬ 
tributions generated from the document corpus and allowing 
the user to interact with them. The system is designed for 
users, who have minimal knowledge of LDA or Topic Mod¬ 
elling methods. To evaluate our design, we run a pilot study 
which uses the abstract’s of 322 Information Visualization 
papers, where every abstract is considered a document. The 
topics generated are then explored by users. The results show 
that users are able to find correlated documents and group 
them based on topics that are similar. 
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INTRODUCTION 

Topic Modeling tries to automate the process of extracting 
topics from documents while also annotate them with seman¬ 
tic information [3]. It is made up of a set of statistical algo¬ 
rithms that extract correlated words from documents. These 


extracted word sets are called Topics. Later, users can anno¬ 
tate the topic with semantic information. For example, con¬ 
sider a word set extracted which contains words such as visu¬ 
alization, sets, clusters, infoviz, interfaces from the document 
corpus. Knowing the words associated with this topic, we 
have a general notion that the word set represents the topic In¬ 
formation Visualization. The “topic” generated is the word set 
and Information Visualization is the semantic “topic name” 
annotated by the user. Latent Dirichlet Allocation (LDA) [4] 
is one of the common methods to perform topic modeling on 
a given corpus of documents. 

LDA generates two types of distributions i.e. the topic distri¬ 
bution for each document in the set and the word distribution 
for each topic. These distributions can be changed by tweak¬ 
ing the hyper-parameters. LDAExplore, tries to give visual 
cues about how these distributions look, and how the topics 
and documents are interrelated at the corpus level, between 
groups or individual documents. It is designed for users who 
may not know what topic modeling algorithms do. They are 
concerned with understanding the document corpus, finding 
the hidden topics in the corpus. LDAExplore gives the users 
the ability to search for documents using keywords from top¬ 
ics. 

One of the basic requirements of the design is that visual 
should scale for a large set of documents while providing the 
ability to see individual and group relations. In comparison 
to the set of documents, we assume that the number of topics 
is a much smaller set. 

One of primary problems with topic modeling methods is that 
the “topic” generated by them, may not be clearly understood 
by the user [8]. One of the solutions to this problem, is to 
introduce a human-in-the-loop paradigm where users can in¬ 
teract with the algorithm, providing feedback such that the 
underlying model can be modified to generate “better” top¬ 
ics. 

Our contributions in this paper are: 

• Creating a comprehensive set of tasks required to design 
visualizations for text analysis using LDA. 

• A visualization that show correlations between topics and 
documents. The design can be scaled up to a larger set of 



documents without applying any aggregation method such 
as clustering to the documents. 

• Combining visual search & filtrering to allow users to filter 
parallely across multiple topics in our design, so that users 
can easily filter a large document corpus. 

• A pilot study which shows the usability of each part of the 
tool. 

RELATED WORK 

There are many different ways to visualize Topic Models. 
The methods include the use of force-directed graphs, par¬ 
allel coordinates, matrices & tree designs. 

Using Matrix Designs 

Matrix or tabular designs are easily understood by users. Ter¬ 
mite uses a tabular layout to promote comparison of terms 
both within and across latent topics. The primary visualiza¬ 
tion design used in it, is a matrix view where rows correspond 
to terms and columns to topics [2]. Another visualization [7] 
uses a navigator to explore LDA generated topics. It shows 
the words in each topic and uses a tabular form to show which 
documents are associated. Once a specific document is se¬ 
lected the use can navigate to its Wikipedia article. These 
matrix layouts, show the correlation between terms and top¬ 
ics but have a difficulty scaling to a large number of docu¬ 
ments. They do not show how the topic is correlated to the 
document, what its likelihood is. 

Using Trees & Hierarchial Clustering 

RoseRiver is another analytics system used to visualize how 
topics evolve [10]. The system uses a tree cut approach with 
a combination of a word cloud. The word cloud is a standard 
method to show a word set with word sizes varying according 
to their frequency (or probability). Similarly, Overview [6] 
is a technique designed for analyzing large text corpora us¬ 
ing TF-IDF. It generates document clusters by hierarchically 
clustering these distances and encoding the result as a topic 
tree. VarifocalReader [15] uses hierarchy structure such that 
it shows chapters, sub-chapters, and pages , and then lines of 
text. It again uses TF-IDF as one of its techniques. Serendip 
[1] is a technique designed for analyzing large text corpora 
using LDA. They present the results in a rectangle where each 
row means a document and each column means a topic. 

Tree layouts can scale in design, but when combined with 
hierarchial clustering increase the amount of preprocessing 
that is necessary to present the tree structure. We choose not 
perform any clustering because user defined clusters and an¬ 
notations, change the nature of the distributions generated by 
LDA. 

Using Force Directed Graphs & Parallel Coordinates 

UTOPIAN [9], uses a force-directed graph to represent top¬ 
ics and is a semi-supervised system. It uses non-negative 
matrix factorization. Some other visualization include iVis- 
Clustering [16] and ParallelTopics [11]. Another parallel 
coordinate design which works with words in the topic di¬ 
rectly is ThemeDelta[l2\. Both have a design which uses 
parallel coordinates and iVisClustering uses a force directed 


graph approach. Topic-based, interactive visual analysis tool 
(TIARA) [17] shows topic distributions across documents 
across time. Force-directed graphs are advantageous in terms 
of easy understanding & usability, but they consume a lot of 
visual / screen space and may not scale for large number of 
terms or documents. 

There are number of challenges LDAExplore tries address i.e. 
a method that can scale for an increasing set of documents, 
a way to quantify the relation between documents and top¬ 
ics and provide and easy way to filter documents & topics, 
while preserving their interrelations. In LDAExplore, we try 
to reduce the amount of preprocessing necessary to display 
the topics, so that changes to the underlying model can be 
performed efficiently. We use parallel coordinates for show¬ 
ing document and topic relations while integrating keyword 
search to make it easy for the user to search for patterns us¬ 
ing words from topics. This makes the filtering process more 
intuitive for the user. 

DESIGN CONSIDERATIONS 
Task Analysis 

As described in the previous section, results got from LDA, 
can be unintuitive. One of the main purpose of the visualiza¬ 
tion is to provide users with the options to explore the docu¬ 
ment set and give them the ability to provide feedback about 
topics to the system and which topics are correlated to docu¬ 
ments, so that word and topic distributions are more intuitive 
and insightful. Following are the set of tasks which form the 
basis for our design. 

Visualize Topics 

LDA generates a set of topics, each having its own individual 
word list. Each word has a probability of being associated 
with the topic. As shown in figure 3, when the user inter¬ 
acts with a specific topic, the top words in the topic will be 
displayed, showing their probability. This is one way for the 
user to identify, what the topic is. 

Overview of Document - Topic Relations 
Once the user has a generic idea of the semantics of the top¬ 
ics, the user can then focus on the correlations between doc¬ 
uments and topics. The main purpose is to be able to see 
which topics are ranked higher over a large subset of docu¬ 
ments, while which topics have a lower rank. The number of 
documents that rank a topic high, gives the user an idea as to 
which topics are more important. 

Remove Topics from the Visual 

When the user knows which topics are important, excess top¬ 
ics from the visual can be removed. 

Filtering Documents 

When the size of the corpus is large, the user will try to un¬ 
derstand the distribution within a subset of the documents. 
Hence the user can filter the documents based on various cri¬ 
teria: 

1. By top words in each document that associated with the 
topics having the highest probability of being related to the 
document. 
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Figure 1; Displaying the Topic Distribution 


2. The user may not know the documents, but may be able to 
identify some based on the name or title of the document. 
Thus, the user can filter out individual documents. 

Perform set operations 

While working with a groups of documents, the user requires 
the ability to perform set operations such as Include and Ex¬ 
clude. Include gives the user, the ability to add a filtered set 
of documents to future filters. Exclude does the exact op¬ 
posite, which is to remove the document set from any filter 
operations in the future. Once the document corpus has been 
explored, the user can export the filtered data for any post 
processing that is performed separately. 

Show & Cluster Similar Topics 

Once the word distributions for each topic are known, the user 
will look for topics that are similar. Topic similarity can be 
measured using methods such KL-Divergence [20]. Similar 
topics can be grouped together, so that the number of topics 
on the visual display can be reduced and hierarchical collec¬ 
tions of topics can be formed. 

Perform Cluster Operations 

Similar topics, will lead to a new group of documents which 
have these topics. The main task for the visual is to enable 
the user to define groups of documents and topics based on 
their own knowledge. 

Annotating Topics 

User should have the option to annotate the topics and docu¬ 
ments or the respective clusters they lie in. 

Prototype Design 

As described before, various sets, groups and LDA visual¬ 
izations are available. They are able to display a very basic 
summarization of the document corpus and reveal basic re¬ 
lationships that occur. They provide the ability to explore 
individual documents, but they lack the ability to show doc¬ 
ument correlations across the dataset. When the correlations 
are shown, the design may not scale for a large number of 
documents. LDAExplore creates a design to handle a sizeable 
number of documents while providing the capability to filter 
and visualize topics for individual documents. 

In its present form, LDAExplore implements a subset of the 
tasks: 


• Visualizing topics 

• Filtering & Set operations 

• Displaying the overview of documents and topics correla¬ 
tions 

• Removing topics. 

Based on the set of tasks defined, we describe how LDAEx¬ 
plore works. 

Visualizing Topics 

We use a treemap to visualize the topics. Figure 2, shows 
how the topics are displayed. Each rectangle is a topic and is 
given an ID like Ti...Tn where there are n defined topics. In 
figure 1, the total number of topics are 20. In the treemap, the 
size of the rectangle defines the probability of the topic being 
associated to the document. In the current design, the likeli¬ 
hood is defined for each individual topic with respect to each 
document, rather than the document collection as a whole. 
Hence in figure 2 the topics are represented by rectangles of 
equal area (but possibly different dimensions, depending on 
the number of topics) showing that all topics have an equal 
likelihood. 

When the user clicks on a specific topic in the treemap, the 
top 10 words associated with the topic are revealed. Figure 3, 
shows an example word distribution. The word with the high¬ 
est probability has the largest area. In this example, the word 
data has the largest probability for topic T4. In the current de¬ 
sign, the number of words displayed per topic is maintained 
constant at 10. The user can traverse back to the Topics, by 
clicking on the Topics tab at the top of the treemap. Whenever 
the user drills down, into the word distribution, the parent is 
always known. 

Relating Topics & Documents 

We use parallel coordinates [14] [5] to show the correlation 
between topics and documents. The parallel coordinates have 
two types of axes. The first one represents the document. 
The second type of axis is for topics. For example, consider 
Figure 1, which shows a total of 21 axes. The first axis ID 
shows the all the documents in the corpus and ranges from 
document ID Dq to 0322 - The next 20 axes each represents 
a topic. Once LDA generates the probabilities of each topic 
in a document, we rank the topics. The topic axes shows the 





















^2^3 ^^3 ^20 1 


Lines at 81.4% opacity. 


Document ID’s 


Number of filtered documents ' 


Topic ranks for document 



tt>alc« 


Generated 

Topics 



Highlighted topic 
/ 


Documents after filtering 



Search Documents on Top Words: 
volume, large 

■ design, space, frsmeworti. users, ditlsrsnl, bms. visuataations, volt 

■ design, space, users. di H eiant, time, visualizatlone, new, volume. Is 

■ design, space, users, diflerent, lime, visualizations, tasks, volume, I 

■ design, space, users, dtneram, time, visualettions, volume, large, cV 

■ design, space, users. dtHerent, time, visualizations, volume, large, c 

■ design, space, users. dtHerent. time, visua lizat ion s , volume, large, c 
framework, volume, large, design, new. space, users, different. tsiK 
new. space, design, framework, volume, large, damonetrate. ueed. 
new. volume, large, deaign. ftemework. uaers, different, time, visuel 
space, design, tasks, rtew. volume, large, demonstrate, ueed, frame 
space, deaign. volume, large, damone tr ate, used, frarrtework. users 
space, tasks, design, fr a rtiework. volume, large, dsnvonstrats, used 
space, technique, flow, framework, volume, large, dsmon st rale. uar 
apace, technqus. flow, framework, volume, large, dselgn. damonst 
space, volume, large, design, demonstrtts, used, techniqus, flow, t 
space, volume, large, design, new. dsmonstrate, ueed. users, drffsr 
volume, large, design, space, tasks, users, different, time, vieususi 


Documents 
with top 
words 


Figure 2: LDAExplore: Filtering Mechanisms 

Various sections of LDAExplore are shown. The parallel coordinates shows the topic distribution, the treemap (which shows the 
word distribution for each topic on drill down), search tab used for filtering. A single filter is applied across the axis T17 to 

isolate documents ranked having T17 at a higher rank. 



Figure 3: Top 10 words in Each Topic 


rank of the topic with respect to the document. So in figure 1, 
document Dq has Topic Ti at rank 1, while others have it at 
rank 4, 7,16&20 respectively. Each document has a different 
color, so that the user can distinguish between the graphs of 
various documents. The ranks are ordered in ascending order 
with 1 showing the highest degree of correlation and n (where 
n is the number of topics), showing 0 or the least correlation. 
Topic ranks give the user which topic is the most correlated 
and which is the least. 

In LDAExplore, we decided to use parallel coordinates in or¬ 
der to be able to display the maximum amount of topics with¬ 
out the screen being too cluttered. When we tried to design 
other types of visualization, we quickly realized that scalabil¬ 


ity is an issue (by doing a quick claim analysis). In addition 
to parallel coordinate, we have a treemap to show the word in 
each topic. Being able see to the words in each topic is im¬ 
portant because it allows the user to easily understand what 
that topic is about. 

Filtering 

LDAExplore provides a range of filtering options. Figure 2 
provides a detailed overview of the filtering features. There 
are three main types of filtering: 

1. Filter by Range - Each axis can be used to filter the docu¬ 
ments by selecting a specific range on the axis. The parallel 
coordinates displays the curves, only for the selected docu¬ 
ments. Each axis can filtered simultaneously, thus creating 
d, filter-chain. 

2. Filter by Searching - A search bar is provided to the user, 
to search the documents using top n words associated with 
each document. The current design, allows the user to 
search using any of the top 10 words. 

3. Filter by Selecting Individual Documents - The documents 
column, lists each document in the corpus by its title. The 
user can click on a specific document to filter it out. When¬ 
ever a document is filtered out, its color is changed to 
downplay the document. 

Once the user has filtered the documents, the user can high¬ 
light a specific document within the filtered document set by 

























using the mouse to hover over the document - word set (col¬ 
umn 3 in figure 2). The total number of document in the fil¬ 
tered set is given at the top on the navigation bar. Also, the 
user can export the filtered data to CSV format. 

Set Operations 

There are 2 main set operations that are supported in LDA- 
Explore i.e. include and exclude. The include operation is 
performed by the Keep and exclusion using Exclude as seen 
in figure 2. 

Generating Top Words 

Once the documents are processed using LDA (from the gen- 
sim library [19]), we end up with a list of topics and the prob¬ 
ability of a word being associated with the topic. There is a 
minimum probability threshold for each topic so that it can 
be considered in the final distribution. Topics below the min¬ 
imum threshold are assigned a probability of 0. To generate 
the top words for each document we implement the follow¬ 
ing: 


P{Wx,dy) = EE P{Wx,U) X P{ti,dy) (1) 

i=l i=l 

where P{wx, dy) is the probability of a word with respect to a 
document, P{wx, U) is the probability of a word with respect 
to a topic and P{ti, dy) is the probability of topic with respect 
to a document. 

Equation 1 calculates the probability of each word in every 
topic for a specific document and then finds the top k words 
which have the highest probability of being associated with 
the given document. This set of words generated are shown 
as the top n searchable words on the visualization. Also, the 
words are used to create the topic-word hierarchy that is used 
in the treemap visualization for topics. 

Ranking i/s Probabiiity 

The parallel coordinates can be used to display topic- 
document relations based on the rank of the topic (rank 1 
being highest probable topic) or based on the actual proba¬ 
bility of the topic. The current design displays the rank as 
it is easier to interpret by the user. As the difference proba¬ 
bilities between topics in a given distribution for a document 
reduce, it becomes harder for the users to define which topic 
are more important across a set of documents than the others 
on the parallel coordinates scale. 

Eiiminate High Frequency Words 

The list of top words for each document and topic, contain 
many words which are common to most documents such 
as data and visualization as seen in figure 2. To alleviate 
this problem. Term Frequency - Inverse Document Frequency 
(TF-IDF) [18] is used, albeit not directly in combination with 
Fatent Dirichlit Allocation (FDA). TF-IDF is used to clear 
terms which are in high frequency across a large subset of 
the documents. It is executed with an arbitrary threshold of t, 
which describes the ratio of how many documents in the total 
corpus, can the word be present. Words over the threshold are 
from the final list of tokens before performing FDA. 


PILOT STUDY 
Study Details 

The Pilot study is an initial study designed to gauge if the 
user can perform the tasks outlined, how usable the tool re¬ 
ally is and get early feedback about the tool. It has been con¬ 
ducted using 5 external participants. The responses from the 
first 3 participants have been used to make some changes to 
the questions and then further feedback has been collected 
from the remaining 2 participants. The number of partici¬ 
pants having prior knowledge about Topic Modeling and how 
FDA works is 2. For the purpose of the study, the users have 
been asked to work with the tool on a standard wide screen 
monitor. The study has been partitioned into 4 parts i.e. the 
overview, topic visualization, filtering and keyword search¬ 
ing. 

The questions in the study are of 4 main types. Tasks ex¬ 
ecution questions (such as How many documents are rep¬ 
resented in this visual?). Understanding questions (such as 
Does the visual have too many things on it?). Reasoning ques¬ 
tions (such as Which is the least important topic? (Important 
means highly ranked topic)) and Usability questions (such as 
Are the rankings on the axis visible / readable?). 

We provide details on how to access the tool, a brief overview 
of the features of the tool with the help of an annotated ex¬ 
ample diagram (such as figure 2). Also, the users are given 
information about the data set, they will be visualizing. For 
the purpose of our study, the data set consists of the abstracts 
from 322 research papers from the domain of Data Visual¬ 
ization. Each abstract is considered as a separate document. 
The information contains attributes like the title of the paper 
and the abstract of the research paper. FDA is used to iden¬ 
tify the plausible underlying topic / topics for each research 
paper. The Title of paper that is seen in the “Documents” col¬ 
umn and the abstract of each research paper. The next section 
describes each part of the study and the associated results. 

Survey & Results 

Overview 

The overview questions are designed to understand if partici¬ 
pants can understand any useful information immediately af¬ 
ter loading the main page and check what they think about the 
overall look and feel of the tool. The main page shows all the 
data without any kind of filtering. 

The users find the initially displayed parallel coordinate to be 
slightly cluttered, when it is presented without any filtering. 
Also, the users find it difficult to search for markings on the 
axis in the parallel coordinate such as axis legends. The over¬ 
all information on the number of topics and the number of 
documents represented in the visual can be deduced easily. 

Visuaiizing Topics 

The topics questions are designed to see if the participants 
can understand what kind of topics are generated by FDA and 
what are the words (with their individual probabilities) asso¬ 
ciated with the topic. This section is very important because 
while inspecting FDA results, users are likely to investigate 
what topics are produced from a corpus of documents. 



The users are able to understand which topics have which 
words and which of them are important within these topics. 
The results confirm that the treemap is useful in visualizing 
topics and that the areas of rectangles used to define which 
topic / word is more or less important (larger size being more 
important) is useful. 

Filtering 

Filtering questions are designed to see if users can use all the 
features within LDAExplore to effectively filter documents to 
find specific patterns. Users are tasked with filtering a specific 
topic’s axis on the parallel coordinate for a range of ranks and 
isolate the subset of documents which find the topic to be im¬ 
portant. Users are then asked to find topics that might be 
similar and documents (research papers) that can be grouped 
together. The study shows users are able to filter the doc¬ 
ument corpus easily. Some users find it difficult to deduce 
which topic is more or less important. This because the par¬ 
allel coordinates design has a single line for each document 
and does not bundle lines together at every rank of each topic, 
thus giving the user an idea of the number of documents at 
each rank. 

Keyword Search 

The last set of questions are on the usefulness of keyword 
search. Users are able to see the effect of filtering using key¬ 
words on other sections (documents & search) of the tool. 
Searching is one of more useful features of the tool because 
it instantaneously reduces the number of lines in the parallel 
coordinate and provides immediate information about docu¬ 
ments based on the user’s search query. 

Recommended Changes to the Design 

Based on the information gathered from the study, we have 
a list of changes that can be done. To reduce the clutter in 
the parallel coordinates (for large corpora), edge bundling can 
be introduced [13]. To de-clutter it further, documents can 
be clustered to visualize the topics that are ranked for these 
groups. The topics themselves can be grouped together based 
upon their similarity. The navigation between topics in the 
treemap can be improved. 

Users in the study find it diffcult to understand the “Docu¬ 
ments” column. The column has titles of documents and acts 
as filter to remove a document from the parallel coordinate 
rather than selectively show it. This is counter intuitive to 
how humans use URL’s/clicks on an item, which is mainly to 
select a page or the highlighted item. Hence this filter can be 
removed or altered to select a specific document and filter out 
the rest of the corpus (or highlight the specific document with 
respect to others). 

The treemap (topics) and parallel coordinate(topic distribu¬ 
tion) are disconnected. This can be modified, so that the 
treemap can be used as additional filtering where the doc¬ 
uments across the parallel coordinates can be filtered using 
topic words. To find out which topics have similar probabil¬ 
ity in the topic distribution for a given document, user can be 
given the option to switch between the rank mode and proba¬ 
bilistic modes in the visualization. 


Searching using words is useful, but there is not mechanism 
to search the corpus for a specific document using attributes 
such as the title. Searching can be improved by adding the 
title of the document to the keywords search section. 

Once the documents have been filtered users need a way to 
save the selection, so that it can be utilized at a later time. 
This in-memory storage of the filter applied can be useful. 

CONCLUSION 

Being able to understand LDA results through a visualization 
is a critical part of helping end users to explore documents 
and recognize patterns. This paper describes LDAExplore, a 
tool to visualize a document corpus. It provides the results 
of a pilot study that shows that the targeted audience have 
a better understanding of what the corpus contains & what 
LDA results mean. 

We have worked closely with 5 individuals, 2 of whom are 
able to understand LDA. They are able to navigate through 
the visualization with minimal guidance in order to solve the 
tasks such as finding documents which are related to specific 
topics that are difficult to do without a visualization. 

In the future we plan to investigate how to enable users to 
cluster topics if they decide that the topics selected are not 
distinct enough. Similarly, users will like to cluster docu¬ 
ments whose topics look similar and see how these aggre¬ 
gated clusters differ. User defined clustering, changes the un¬ 
derlying model within LDA. It will be interesting to see how 
the changes in model affect other topics or documents. 
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